Supporting Data Provenance in Data-Intensive Scalable Computing Systems

نویسندگان

  • Matteo Interlandi
  • Tyson Condie
چکیده

Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Data provenance support is a key building block in libraries that aim to provide debugging support for data processing pipelines. In this paper we report our experience in building Titian: a data provenance system targeting the Apache Spark framework. Our focus here is to analyze the design choices and trade offs that we and others made. Ultimately, we believe there is still more work to do before reaching a widespread adoption of data provenance outside the research community.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Replication-Based Scheduling in Cloud Computing Environment

Abstract— High-performance computing and vast storage are two key factors required for executing data-intensive applications. In comparison with traditional distributed systems like data grid, cloud computing provides these factors in a more affordable, scalable and elastic platform. Furthermore, accessing data files is critical for performing such applications. Sometimes accessing data becomes...

متن کامل

Provenance in DISC Systems: Reducing Space Overhead at Runtime

Data intensive scalable computing (DISC) systems, such as Apache Hadoop or Spark, allow to process large amounts of heterogenous data. For varying provenance applications, emerging provenance solutions for DISC systems track all source data items through each processing step, imposing a high space and time overhead during program execution. We introduce a provenance collection approach that red...

متن کامل

Supporting Large Scale Data-Intensive Computing with the FusionFS Distributed File System

State-of-the-art yet decades-old architecture of HPC storage systems has segregated compute and storage resources, bringing unprecedented inefficiencies and bottlenecks at petascale levels and beyond. This paper presents FusionFS, a new distributed file system designed from the ground up for high scalability (16K nodes) while achieving significantly higher I/O performance (2.5TB/sec). FusionFS ...

متن کامل

The Case for Fine-Grained Stream Provenance

The current state of the art for provenance in data stream management systems (DSMS) is to provide provenance at a high level of abstraction (such as, from which sensors in a sensor network an aggregated value is derived from). This limitation was imposed by high-throughput requirements and an anticipated lack of application demand for more detailed provenance information. In this work, we firs...

متن کامل

Editorial : Scientific Workflows , Provenance and Their Applications

Scientific workflows play a crucial role in modern eScience [5] where many significant scientific discoveries are achieved through complex and distributed computations. For many scientists in the Life Sciences, in bioinformatics, geosciences, chemistry, physics, and numerous other domains, scientific workflows have become an enabling technology to formalize and automate complex and data intensi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • IEEE Data Eng. Bull.

دوره 41  شماره 

صفحات  -

تاریخ انتشار 2018